Show the code
import pandas as pd
import numpy as np
from lets_plot import *
LetsPlot.setup_html(isolated_frame=True)Course DS 250
Aidan Pfunder
A SHORT (2-3 SENTENCES) PARAGRAPH THAT DESCRIBES KEY INSIGHTS TAKEN FROM METRICS IN THE PROJECT RESULTS THINK TOP OR MOST IMPORTANT RESULTS. (Note: this is not a summary of the project, but a summary of the results.)
A Client has requested this analysis and this is your one shot of what you would say to your boss in a 2 min elevator ride before he takes your report and hands it to the client.
Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”). In your report include one record example (one row) from your new data, in the raw JSON format. Your example should display the “NaN” for at least one missing value.__
type your results and analysis here
[{"airport_code":"ATL","airport_name":"Atlanta, GA: Hartsfield-Jackson Atlanta International","month":"January","year":2005.0,"num_of_flights_total":35048,"num_of_delays_carrier":"1500+","num_of_delays_late_aircraft":-999,"num_of_delays_nas":4598,"num_of_delays_security":10,"num_of_delays_weather":448,"num_of_delays_total":8355,"minutes_delayed_carrier":116423.0,"minutes_delayed_late_aircraft":104415,"minutes_delayed_nas":207467.0,"minutes_delayed_security":297,"minutes_delayed_weather":36931,"minutes_delayed_total":465533}]
Which airport has the worst delays? Describe the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.
type your results and analysis here
# Task 2: worst airport by proportion & average delay
df2 = df_clean.copy()
# compute total minutes delayed across all categories
delay_min_cols = [
"minutes_delayed_carrier", "minutes_delayed_late_aircraft",
"minutes_delayed_nas", "minutes_delayed_security",
"minutes_delayed_weather"
]
df2["total_delay_mins"] = df2[delay_min_cols].sum(axis=1)
summary = (
df2
.groupby("airport_code")
.agg(
total_flights = ("num_of_flights_total", "sum"),
delayed_flights = ("num_of_delays_total", "sum"),
total_delay_mins = ("total_delay_mins", "sum"),
)
.assign(
prop_delayed = lambda d: d["delayed_flights"] / d["total_flights"],
avg_delay_hours = lambda d: (d["total_delay_mins"] / d["delayed_flights"]) / 60
)
.reset_index()
)
print(summary.sort_values("prop_delayed", ascending=False)) airport_code total_flights delayed_flights total_delay_mins \
5 SFO 1630945 425604 25853438.0
3 ORD 3597588 830825 54598084.0
0 ATL 4430047 902443 52769237.0
2 IAD 851571 168467 9862739.0
4 SAN 917862 175132 8043504.0
1 DEN 2513974 468519 24562692.0
6 SLC 1403384 205160 9822243.0
prop_delayed avg_delay_hours
5 0.260955 1.012421
3 0.230939 1.095258
0 0.203710 0.974563
2 0.197831 0.975734
4 0.190804 0.765471
1 0.186366 0.873771
6 0.146189 0.797934
What is the best month to fly if you want to avoid delays of any length? Describe the metric you chose and why you chose it to calculate your answer. Include one chart to help support your answer, with the x-axis ordered by month. (To answer this question, you will need to remove any rows that are missing the Month variable.)
type your results and analysis here
df3 = df_clean.dropna(subset=["month"]).copy()
month_summary = (
df3
.groupby("month")
.agg(
total_flights = ("num_of_flights_total", "sum"),
delayed_flights = ("num_of_delays_total", "sum"),
)
.assign(
prop_delayed = lambda d: d["delayed_flights"] / d["total_flights"]
)
.reset_index()
)
month_order = [
"January","February","March","April","May","June",
"July","August","September","October","November","December"
]
month_summary["month"] = pd.Categorical(
month_summary["month"],
categories=month_order,
ordered=True
)
best_month = month_summary.loc[month_summary["prop_delayed"].idxmin(), "month"]
print(f"Best month to fly: {best_month}")
from lets_plot import ggplot, aes, geom_bar, labs
p3 = (
ggplot(month_summary, aes(x="month", y="prop_delayed"))
+ geom_bar(stat="identity")
+ labs(
x="Month",
y="Proportion of Flights Delayed",
title="Flight Delay Proportion by Month"
)
)
p3Best month to fly: September
According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories. Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild). You will need to replace all the missing values in the Late Aircraft variable with the mean. Show your work by printing the first 5 rows of data in a table. Use these three rules for your calculations:
a. 100% of delayed flights in the Weather category are due to weather
a. 30% of all delayed flights in the Late-Arriving category are due to weather
a. From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%
type your results and analysis here
# Task 4: Compute total number of flights delayed by weather (severe + mild)
# 1) Copy and parse month names to numbers
df4 = df_clean.copy()
df4["month_num"] = pd.to_datetime(
df4["month"], format="%B", errors="coerce"
).dt.month
# 2) Convert delay-count columns to numeric, fill late-aircraft missing with mean
for col in [
"num_of_delays_late_aircraft",
"num_of_delays_nas",
"num_of_delays_weather"
]:
df4[col] = pd.to_numeric(df4[col], errors="coerce")
mean_late = df4["num_of_delays_late_aircraft"].mean()
df4["num_of_delays_late_aircraft"] = df4["num_of_delays_late_aircraft"].fillna(mean_late)
df4["num_of_delays_nas"] = df4["num_of_delays_nas"].fillna(0)
df4["num_of_delays_weather"] = df4["num_of_delays_weather"].fillna(0)
# 3) Vectorized proportion for NAS
nas_prop = np.where(df4["month_num"].between(4, 8), 0.40, 0.65)
# 4) Compute total weather delays
df4["weather_delay_count"] = (
df4["num_of_delays_weather"] # 100% severe
+ df4["num_of_delays_late_aircraft"] * 0.30 # 30% late‐aircraft
+ df4["num_of_delays_nas"] * nas_prop # NAS proportion
)
# 5) Show first 5 rows
df4.loc[:, [
"airport_code", "month", "month_num",
"num_of_delays_weather",
"num_of_delays_late_aircraft",
"num_of_delays_nas",
"weather_delay_count"
]].head()| airport_code | month | month_num | num_of_delays_weather | num_of_delays_late_aircraft | num_of_delays_nas | weather_delay_count | |
|---|---|---|---|---|---|---|---|
| 0 | ATL | January | 1.0 | 448 | -999 | 4598 | 3137.00 |
| 1 | DEN | January | 1.0 | 233 | 928 | 935 | 1119.15 |
| 2 | IAD | January | 1.0 | 61 | 1058 | 895 | 960.15 |
| 3 | ORD | January | 1.0 | 306 | 2255 | 5415 | 4502.25 |
| 4 | SAN | January | 1.0 | 56 | 680 | 638 | 674.70 |
Using the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Describe what you learn from this graph.
type your results and analysis here
# Task 5: Barplot of proportion of all flights delayed by weather per airport
# 1) Flag rows with any weather delays
df5 = df4.copy()
df5["weather_flag"] = df5["weather_delay_count"] > 0
# 2) Aggregate by airport
plot_data = (
df5
.groupby("airport_code")
.agg(
total_flights = ("num_of_flights_total", "sum"),
total_weather_delays = ("weather_delay_count", "sum")
)
.assign(
prop_weather_delay = lambda d: d["total_weather_delays"] / d["total_flights"]
)
.reset_index()
)
# 3) Plot
from lets_plot import ggplot, aes, geom_bar, labs
p5 = (
ggplot(plot_data, aes(x="airport_code", y="prop_weather_delay"))
+ geom_bar(stat="identity")
+ labs(
x="Airport",
y="Proportion of Flights Delayed by Weather",
title="Weather-Related Delay Proportion by Airport"
)
)
p5Which delay is the worst delay? Create a similar analysis as above for Weahter Delay with: Carrier Delay and Security Delay. Compare the proportion of delay for each of the three categories in a Chart and a Table. Describe your results.
type your results and analysis here